Project presentation

Ander Barrio Campos(231938), Dionysios Dimitreas(s232752), Erikas Mikužis(s223164), Valeria Tedeschi(s231945), Angeliki Vliora(233059)

Introduction

Tidyverse: enhances data manipulation and visualization with a tidy data workflow, fostering code that is

  • readable
  • maintainable
  • reproducible
  • Core packages ggplot2, dplyr, tidyr, readr, broom

Our Dataset

Source: Behavioral Risk Factor Surveillance System (BRFSS) 2015.

Key Features: Health indicators related to diabetes, including:

  • lifestyle factors
  • health outcomes
  • demographic information

:::

Introduction

Research Questions

  1. What are the key predictive variables in diabetes prognosis?

  2. How does gender influence the manifestation and progression of diabetes?

Materials and Methods

Flowchart(quick explanation on how we plan to answer research questions)

Data Cleaning and Augmentation

Data Cleaning

  • Removed Missing Values: df_cleaned <- df |> drop_na()

  • Verified Data Types: column_types <- summarise(df_cleaned, across(everything(), class))

  • Filtered Incorrect Values: Filtered out rows with values outside expected ranges.

Data Augmentation

  • Transformed Variables: Binary to categorical (e.g., Smoker to Smoking Status).

  • Created New Variables: E.g., Habits, Health Risk, based on lifestyle and health indicators.

  • Socio-Economic Class: Derived from income, education, and healthcare status.

Data Analysis

Correlation

  • Between all variables: health related variables correlated between them. Not highly negatively correlated variables.

  • With the target variable: GenHlth, HighBP and BMI most correlated with diabetes variable.

GLM

  • All variables: Creation of a GLM with all numerical variables.

  • Step: Step forward and backward for best variables selection.

  • Results: Lowest AIC achieved with backward model (contains 19 variables9.

Data Analysis

PCA + Logistic Regression

  • Selected components: 15 components that reach 80% of explained variability.

  • Logistic regression: Use of those components to perform a diabetes prediction model.

  • Results: Great accuracy with a value of 87%.

New GLMs

  • Men VS. Women: Creation of two different datasets according to sex.

  • Results: Better performance in Men model due to lowest AIC. More importance to general health variables and also to fruit variable. Much better performance than the GLM from first part of analysis.

Results

Data description (show the plots and what explain insights they give us, maybe even answer some research questions)

Results

Analysis part1 results (same as part above)

Results

Analysis part2 results (same as part above)

Discussion

Discussion and key takeaways